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Abstract 

Soft error arises from the strike of high-energy particle, and has become a key challenge in microprocessor design. Designers 
clearly require accurate estimates the error rates to make appropriate cost. A key aspect of the estimation is the analysis of 
architectural vulnerability factor (AVF). The AVF of a processor structure is defined as the probability that a fault in a processor 
structure will result in a visible error in the final output of a program, and AVF is one of the most commonly used estimation 
metrics of a structure's vulnerability. In this paper, we describe the AVF's concept, computing methods and comparison of three 
models, and we also discuss the directions for future research. 
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Introduction 

Soft errors due to radiation-caused by neutrons in cosmic rays or alpha particles in packaging material are 
becoming an increasing burden for microprocessor designers. The microprocessor is the core of the electronics 
components, and once the soft error occurs, it may cause serious damage, and the approaches for protection 
introduce a significant penalty in performance, power, die size, and design time. Consequently, designers must 
carefully evaluate the soft error rate of a microprocessor to decide on the appropriate amount of protection 
necessary for a target market. 

If a particle strike causes a bit to flip or a piece of logic to generate a wrong result, we call the bit flip or the wrong 
result a raw soft error. Fortunately, not all raw soft errors cause the program to fail. For example, a soft error in a 
functional unit that is not currently processing an instruction or in an SRAM cell that is not storing useful data will 
not harm the execution. Such an error is said to be masked. Research has shown that there is a large masking effect 
at the architecture and micro-architecture levels [1, 2, 3, and 6]; e.g., Wang et al report more than 85% masking [3]. 
Usually, the masking will not cause the program to fail. Hence, an important aspect of fault modeling is how to 
measure the impact of the masking effect of the fault. 

Three common terms are often used to discuss system reliability: failure rate, mean time between failure (MTBF), 
and failures in time (FIT) [4]. MIBF and FIT are two commonly used units for error rates currently. For example, for 
its Power4 processor-based systems, IBM targets 1000 years system MTBF for silent data corruption (SDC) errors, 
25 years system MTBF for detected recoverable errors (DUE) that result in a system crash, and 10 years system 
MTBF for DUE errors that result in an application crash [4]. FIT is inversely related to MTBF. One FIT specifies one 
failure in a billion hours. Thus, 1000 years MTBF equals 114 FIT. The effective FIT rate per bit is influenced by 
several vulnerability factors. In general, a vulnerability factor indicates the probability that an internal fault in a 
device's operation will result in an externally visible error [6]. According to the visible system error Mukherjee et al 
introduced the architectural vulnerability factor (AVF) and used AVF to estimate system reliability. 

Computing AVF for all processor structures is a key aspect of such soft error analysis. This paper generalizes the 
current methods for evaluating the AVF. Firstly, it introduces the AVF's basic concept, then analyzes the evaluation 
method, discusses their limitations and advances, and suggests the directions for future research. 
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Architectural Vulnerability Factor and ACE Analysis 


Architectural Vulnerability Factor 

Mukherjee et al found that not all faults in a micro-architectural structure affect the final outcome of a program. As 
a result, an estimate based only on raw device fault rates will be pessimistic, leading architects to over-design their 
processor's fault-handling features. For example, a single bit fault in a branch predictor will not affect the sequence 
or results of any committed instructions. They call the probability that a fault in a processor structure will result in 
a visible error in the final output of a program that structure's Architectural vulnerability factor (AVF). For 
example, a single bit fault in a branch predictor will not affect the executing of any program, thus, the branch 
predictor's AVF is 0%. In contrast, a single bit fault in the committed program counter will cause the wrong 
instructions to be executed, almost certainly affecting the program's result. Hence, the AVF for the committed 
program counter is effectively 100%. The overall error rate of a micro-architectural structure is the product of its 
raw fault rate and it's AVF. By summing the contributions of all on-chip structures, a processor architect can map 
the raw fault rate (dictated by process and circuit issues) to an overall processor error rate, and thus determine 
whether the design meets its error rate goals (set according to the target market). 


ACE Analysis 

The AVF of a processor structure is defined as the probability that a fault in that structure will result in a visible 
error in the final output of a program [6]. A bit in which a fault will result in incorrect execution is said to be 
necessary for architecturally correct execution (ACE); these bits are termed ACE bits. All other bits are Un-ACE bits. 
An individual bit may be ACE for a fraction of the overall execution cycles and Un-ACE for the rest. Therefore, the 
AVF of a single bit can be defined as the fraction of cycles that the bit is ACE. For hardware structure H with size 
B h (in bits), its AVF over a period of N cycles can be expressed as follows [6]: 


AVF h 


ZACE bits in H 

N 


B h xN 


For example, if a storage cell contains ACE bits for a million cycles out of an execution of ten million cycles, then 
the AVF for that cell is 10%.The average AVF of an entire processor can be computed as the weighted average of 
the AVFs of each structure for systems of reasonable size [5]. 

Given this equation of AVF, we would like to determine which bits are ACE and which are Un-ACE. However, 
ACE bits are uneasy to determine. So we desire a conservative (upper-bound) AVF estimate, we firstly assume that 
all bits are ACE bits unless we can show otherwise. We then identify as many sources of Un-ACE bits as we can. 
Mukherjee et al has divided the sources of Un-ACE bits into two general categories: micro-architectural un-bits and 
architectural Un-ACE bits. The processor state bits that cannot influence the committed instruction path are called 
micro-architectural Un-ACE bits. They can arise from the following four situations: Idle or Invalid State, Mis- 
speculated State, Predictor Structures, and Ex-ACE State. Architectural Un-ACE bits are those that affect correct- 
path instruction execution, but only in ways that do not change the output of the system. The five sources of 
architectural Un-ACE bits are: NOP instructions. Performance-enhancing instructions, Predicated-false instructions, 
dynamically dead instructions, and Logical masking. 

Mukherjee et al introduced lifetime analysis to compute the AVF of a processor's instruction queue and execution 
units [6]. Lifetime analysis involves dividing up a bit's lifetime during a program execution into ACE and Un-ACE 
components. The AVF is the fraction of the bit's lifetime during which the bit contained ACE state. To compute the 
AVF we use Mukherjee et al's conservative assumption that the entire lifetime is ACE and then systematically 
prove which portion of the bit's lifetime is Un-ACE. Table 1 shows a detailed classification of lifetimes into ACE 
and Un-ACE components [7]. By definition idle, read-to-write, and write-to-write are Un-ACE. Whether fill-to-read 
and write-to-read are Un-ACE depends on the read itself. For example, if the read is dynamically dead (its value 
will never be used in future and, therefore, will not affect the final outcome of a program), then the write-to-read 
time is Un-ACE. Other examples of Un-ACE reads include those on the wrong-path or those falsely predicated. 
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Computing AVF 

According to the study, the following models can be used to calculate the AVF: (1) Statistical fault injection, (2) 
Analysis model, (3) Performance model. Section 3.1 describes the study about statistical fault injection. Section 3.2 
describes how to compute the AVF using Little's Law. Section 3.3 describes how to compute the AVF using a 
performance model. Section 3.4 shows the comparison amongst the three models. 


TABLE 1 CLASSIFICATION OF LIFETIMES INTO NON-OVERLAPPING ACE OR UN- ACE COMPONENTS. 
DYNAMICALLY DEAD READS OR WRITES (NOT SHOWN EXPLICITLY) CONVERT ACE INTO UN- ACE COMPONENTS 


Processor Structure 

Lifetime Classification 

ACE 

Un-ACE 

Unknown 

Write-through data cache 

fill-to-read, read-to-read, write- 
to-read 

idle, fill-to-write, fill-to-evict, 
read-to-write, read-to-evict, 
write-to-write, write-to-evict, 
evict-to-fill 

fill-to-end, read-to end, write-to- 
end 

Write-back data cache 

fill-to-read,read-to-read, write-to- 
read, write-to-evict,write-to-end, 
some of Un-ACE components 
can be conditionally ACE(see 
prose) 

idle,fill-to-write,fill-to-evict,read- 
to-write,read-to-evict, write-to- 
write, 
evict-to-fill 

fill-to-end, read-to end 

Data Translation Buffer 

fill-to-read, read-to-read 

idle, read-to-evict, evict-to-fill 

fill-to-end, read-to end 

Store Buffer 

fill-to-read,fill-to-evict,fill-to- 

end,read-to-read,read-to- 

evict,read-to-end 

idle, evict-to-fill 

none 


Statistical Fault Injection 

Many studies have used fault injection into hardware RTL models [3, 8, and 10] and demonstrated AVFs of 1%- 
10% for latches [8]. For example, Wang injected faults into an RTL model of the Alpha 21264 processor and 
reported AVFs of less than 10% for latches [8]. Wu et al developed a fault injection tool based on VHDL that 
implemented by simulator technique [9]. The RTL model has all the hardware structures necessary to create a 
processor, and it's the biggest advantage of using an RTL model. However, statistical fault injection requires 
simulating a large number of fault cases to provide adequate statistical significance and an RTL model. The RTL 
model is generally not available during the architectural exploration phase of a microprocessor design project. 


Analysis Model 

Little's Law [11] as the analysis model can be translated into the equation N = BxL , where N = average number of 
bits in a box, B = average bandwidth per cycle into the box, and L = average latency of an individual bit through 
the box. Applying this to ACE bits, we get the average number of ACE bits in a box as the product of the average 
bandwidth of ACE bits into the box ( B ace ) times the average residence cycles of an ACE bit in the box ( L ace j.Thus, 
we can express the AVF of a structure as: 


AVF h 


^ ace * ^ ace 


total number of bits in H 


In many cases, it is possible to compute the bandwidth of ACE bits into a structure and the average residence 
cycles of ACE instructions using hardware performance counters, allowing AVF estimation without a simulation 
model. Using Little's Law, we can compute the average number of ACE bits resident in a structure [11, 12], 
therefore, the AVF of the structure are usually used in early design without the RTL and performance model. 


Performance Model 

The main idea of performance model is to determine the object flows through machines which are the ACE bits or 
Un-ACE bits. According to the lifetime analysis we know the challenge of performance model is how to determine 
the Un-ACE parts. To compute the AVF of a structure using the equation in Section 2.2, we need the following 
information: 

■ sum of all residence cycles of all ACE bits of the objects resident in the structure during the execution of 
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the program, 

■ total execution cycles for which we observe the ACE bits' residence time, 

■ Total number of bits in a hardware structure. 

Using a performance model, we can compute all of the information above. We assume objects that carry instruction 
information along the pipeline. The AVF algorithm can be divided up into three parts. First step: as an instruction 
flows through different structures in the pipeline, we record the residence time of the instruction in the structure. 
Then, before the instruction disappears from the machine-either via a commit or via a squash-we update the 
structures it flowed through with a variety of information, such as the residence cycles, whether the instruction 
committed, etc. Second step: if the instruction commits, we put the instruction in a post-commit analysis window 
to determine if the instruction is dynamically dead or if there are any bits that are logically masked. Third step, at 
the end of the simulation, using the information captured in step 1 and 2, we can easily compute the AVF of a 
structure. 

Currently researchers have extensively studied AVF evaluation methods in the architecture level and put forward 
several assessment tools [2, 13, 16, and 17]. Xin Fu et al proposed the Sim-SODA (Software Dependability Analysis), 
a unified framework for estimating microprocessor reliability in the presence of soft errors at the architectural level 
[13]. They used COOLDOWN and hamming distance one mechanisms introduced [14] to accurately compute AVF. 
The Sim-SODA framework includes AVF models for instruction queue, register file, cache, TLB, and load/store 
queue. Yu Cheng et al proposes a Hybrid AVF Evaluation Strategy (HAES) which combines memory access 
analysis and instruction identification for AVF evaluation [18]. Then the HAES is integrated into a general 
simulator and an improved AVF evaluation framework is implemented. Compared with other AVF evaluation 
tools, AVF computed using the evaluation framework is reduced by 22.6% averagely. 

The Comparison 

In this section, we compare the three models above from main future, advantage, disadvantage; the details are 
shown in table 2. 

AVF Extension and Research 

Currently it has carried out many studies in AVF computing calculations. Based on the current methods, 
researchers make the computing expend on some new aspects. For example, the method based on indication bit of 
online [15] and the occupancy-based online AVF computing method [23]. This occupancy-based online AVF 
computing method can efficiently compute AVF of different structures when the program is operating. The AVFs 
vary from different types of application and present diversity with time changes [19, 22]. 

Sridharan et al also propose the concept of Program Vulnerability Factor (PVF) [20, 21]. PVF metric allows insight 
into the vulnerability of a software resource to hardware faults in a micro-architecture independent way, and can 
be used to make judgments about the relative reliability of different programs. This method could enable the 
development of reliability techniques at a compiler or even programming language level. To determine whether 
PVF is a good predictor of relative vulnerability, Sridharan et al measured the PVF of each program's Architectural 
Integer Register File (ARF) and compare these values to the AVF of the Integer Physical Register File (PRF). 
Experiment results show that the PVF and AVF values have a correlation coefficient of 0.98 [20]. In addition, most 
prior studies have focused on single-threaded micro-architecture, TAN J et al make AVF evaluation methods 
expand on multi-threaded micro-architecture, and propose the corresponding soft protective measures [24]. 

Future Trends 

In this paper, we have summarized the current research methods, and the future research will be mainly 
concentrated in the following aspects: 

■ Most work only considers the case of single-threaded applications currently, AVF show diversity from 
the system workloads, operating systems, and also changes with the system state. Therefore the 
appropriate methods and techniques will be the focus of the study about multi-threaded system. There 
are some researchers conducted a study in this regard [25, 26, 27]. 
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■ The current AVF research work is still concentrated on the architecture and micro-architecture, the 
future research will be expanded to the level of the whole system level and the program level. 

■ Most of the existing research is about transient faults, but intermittent failures are also the leading 
causes of system failure. Some research on intermittent failures has been done already [28, 29]. 

■ The current assessment is mainly used in the microprocessor's early design, the future can be used in 
other areas. 


TABLE 2 THE COMPARISON AMONGST THE THREE MODELS 


Analysis Model 

Main feature 

Advantage 

Disadvantage 

Statistical fault injection 
[3, 8, 9, 10] 

Fault injection into the processor 
model, analyze the impact on the 
result of the program 

fault injection for any state unit, 
high precision 

Long experiment time, analyze 
different structures' AVF in the 
late of design 

Litter's Law [11, 12] 

Computing the production and 
propagation of fault based on the 
numerical analysis 

Establish the relationship between 
the soft error and system MTBF, 
analyze the contribution of 
different structures on system 
reliability 

It’s complex to analyze and 
compute all the structures' AVF 

Performance mode 
[2, 13, 14, 16, 17, 18] 

Analyzing ACE and Un-ACE 

The different structures' AVF 
values are computed in early 
stage of design and it's easy to 
guide the reliability design 

It’s complex to determine the ACE 
and Un-ACE 
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